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Abstract 

We propose a new learning-based method for estimat¬ 
ing 2D human pose from a single image, using Dual-Source 
Deep Convolutional Neural Networks (DS-CNN). Recently, 
many methods have been developed to estimate human pose 
by using pose priors that are estimated from physiologi¬ 
cally inspired graphical models or learned from a holis¬ 
tic perspective. In this paper, we propose to integrate both 
the local (body) part appearance and the holistic view of 
each local part for more accurate human pose estimation. 
Specifically, the proposed DS-CNN takes a set of image 
patches (category-independent object proposals for train¬ 
ing and multi-scale sliding windows for testing) as the input 
and then learns the appearance of each local part by con¬ 
sidering their holistic views in the full body. Using DS-CNN, 
we achieve both joint detection, which determines whether 
an image patch contains a body joint, and joint localization, 
which finds the exact location of the joint in the image patch. 
Finally, we develop an algorithm to combine these joint de¬ 
tection/localization results from all the image patches for 
estimating the human pose. The experimental results show 
the effectiveness of the proposed method by comparing to 
the state-of-the-art human-pose estimation methods based 
on pose priors that are estimated from physiologically in¬ 
spired graphical models or learned from a holistic perspec¬ 
tive. 


1. Introduction 

By accurately locating the important body joints from 
2D images, human pose estimation plays an essential role 
in computer vision. It has wide applications in intelligent 
surveillance, video-based action recognition, and human- 
computer interaction. However, human pose estimation 
from an 2D image is a well known challenging problem - 
too many degrees of freedom are introduced by the large 
variability of the human pose, different visual appearance 
of the human body and joints, different angles of camera 
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Figure 1. An illustration of the proposed method based on DS- 
CNN. (a) Input image and generated image patches, (b) DS-CNN 
input on an image patch (containing a local part - ankle), (c) DS- 
CNN input on full body and holistic view of the local part in the 
full body, (d) DS-CNN for learning, (e) DS-CNN output on joint 
detection, (f) DS-CNN output on joint localization. 


view, and possible occlusions of body parts and joints. 

Most of the previous works on human pose estimation 
are based on the two-layer part-based model ElEllSlISlEl 
B113 uni El [H [m m da . The first layer focuses on lo¬ 
cal (body) part appearance and the second layer imposes the 
contextual relations between local parts. One popular part- 
based approach is pictorial structures m , which capture the 
pairwise geometric relations between adjacent parts using 
a tree model. However, these pose estimation methods us¬ 
ing part-based models are usually sensitive to noise and the 
graphical model lacks expressiveness to model complex hu¬ 
man poses CD. Furthermore, most of these methods search 
for each local part independently and the local appearance 
may not be sufficiently discriminative for identifying each 
local part reliably. 

Recently, deep neural network architectures, specifically 
deep convolutional neural networks (CNNs), have shown 
outstanding performance in many computer vision tasks. 
Due to CNNs’ large learning capacity and robustness to 
variations, there is a natural rise in the interest to directly 
learn high-level representations of human poses without us¬ 
ing hand-crafted low-level features and even graphical mod- 






















els. Toshev et al. ca present such a holistic-styled pose es¬ 
timation method named DeepPose using DNN-based joint 
regressors. This method also uses a two-layer architecture: 
The first layer resolves ambiguity between body parts (e.g. 
left and right legs) in a holistic way and provides an ini¬ 
tial pose estimation and the second layer refines the joint 
locations in a local neighborhood around the initial estima¬ 
tion. From the experiments in Ga, DeepPose can achieve 
better performance on two widely used datasets, FLIC and 
LSP, than several recently developed human pose estima¬ 
tion methods. However, DeepPose does not consider local 
part appearance in initial pose estimation. As a result, it has 
difficulty in estimating complex human poses, even using 
the CNN architecture. 

In this paper, we propose a dual-source CNN (DS-CNN) 
based method for human pose estimation, as illustrated in 
Fig. [2 This proposed method integrates both the local part 
appearance in image patches and the holistic view of each 
local part for more accurate human pose estimation. Fol¬ 
lowing the region-CNN (R-CNN) that was developed for 
object detection El, the proposed DS-CNN takes a set of 
category-independent object proposals detected from the in¬ 
put image for training. Compared to the sliding windows or 
the full image, that are used as the input in many previous 
human pose estimation methods, object proposals can cap¬ 
ture the local body parts with better semantic meanings in 
multiple scales |[T7][T8l. In this paper, we extend the original 
single-source R-CNN to a dual-source model (DS-CNN) by 
including the full body and the holistic view of the local 
parts as a separate input, which provides a holistic view for 
human pose estimation. By taking both the local part ob¬ 
ject proposals and the full body as inputs in the training 
stage, the proposed DS-CNN performs a unified learning 
to achieve both joint detection, which determines whether 
an object proposal contains a body joint, and joint local¬ 
ization, which finds the exact location of the joint in the ob¬ 
ject proposal. In the testing stage, we use multi-scale sliding 
windows to provide local part information in order to avoid 
the performance degradation resulted by the uneven distri¬ 
bution of object proposals. Based on the DS-CNN outputs, 
we combine the joint detection results from all the sliding 
windows to construct a heatmap that reflects the joint lo¬ 
cation likelihood at each pixel and weightedly average the 
joint localization results at the high-likelihood regions of 
the heatmap to achieve the final estimation of each joint lo¬ 
cation. 

In the experiments, we test the proposed method on two 
widely used datasets and compare its performance to several 
recently reported human pose estimation methods, includ¬ 
ing DeepPose. The results show the effectiveness of the pro¬ 
posed method which combines local appearance and holis¬ 
tic view. 


2. Related Work 

Part-based models for human pose estimation. In the 

part-based models, human body is represented by a collec¬ 
tion of physiologically inspired parts assembled through a 
deformable configuration. Following the pictorial-structure 
model ElID, a variety of part-based methods have been 
developed for human pose estimation |[2l|3l[4l[5l[6l[7][3[9l 
[TOl [TTl \T2\ [T 3 I rm ITSl . While many early methods build 
appearance models for each local part independently, re¬ 
cent works (iiiaoigiiniiisiii attempt to design strong 
body part detectors by capturing the contextual relations 
between body parts. Johnson and Everingham cni parti¬ 
tion the pose space into a set of pose clusters and then ap¬ 
ply nonlinear classifiers to learn pose-specific part appear¬ 
ance. In 1121, independent regressors are trained for each 
joint and the results from these regressors are combined 
to estimate the likelihood of each joint at each pixel of 
the image. Based on the appearance models built for each 
part, these methods usually leverage tree-structured graph¬ 
ical models to further impose the pairwise geometric con¬ 
straints between parts |[2l[3l[8l[2l|2Tl. Due to the limited ex¬ 
pressiveness ca, the tree-structured graphical models of¬ 
ten suffer from the limb ambiguity, which affects the ac¬ 
curacy of human pose estimation. There have been several 
works that focus on designing richer graphical models to 
overcome the limitation of tree-structured graphical mod¬ 
els. For example, in ifTOl . mixture of pictorial structure mod¬ 
els are learned to capture the ‘multi-modal’ appearance of 
each body part. Yang and Ramanan ED introduce a fiexible 
mixture-of-parts model to capture contextual co-occurrence 
relations between parts. In O, the hierarchical structure 
is incorporated to model high-order spatial relation among 
parts. Loopy models (51 [22l |23l |24l allow to include addi¬ 
tional part constraints, but require approximate inference. In 
the latter experiments, we include several above-mentioned 
part-based methods for performance comparison. 

Deep convolutional neural network (CNN) in com¬ 
puter vision. As a popular deep learning approach, CNN 
f25\ attempts to learn multiple levels of representation 
and abstraction and then use it to model complex non¬ 
linear relations. It has been shown to be a useful tool in 
many computer vision applications. For example, it has 
demonstrated impressive performance for image classifica¬ 
tion (26l[27]|28l[29l. More recently, CNN architectures have 
been successfully applied to object localization and detec¬ 
tion |[30||T7l|3Tl. In (ST), a single shared CNN named ‘Over¬ 
feat’ is used to simultaneously classify, locate and detect 
objects from an image by examining every sliding window. 
In this paper, we also integrate joint detection and localiza¬ 
tion using a single DS-CNN. But our problem is much more 
challenging than object detection - we need to find precise 
locations of a set of joints for human pose estimation. Gir- 
shick et al. fTfi apply high-capacity R-CNNs to bottom-up 


object proposals 13^ for object localization and segmen¬ 
tation. It achieves 30% performance improvement on PAS¬ 
CAL VOC 2012 against the state of the art. Zhang et al. 1181 
adopt the R-CNN (nl to part localization and verify that 
the use of object proposals instead of sliding windows in 
CNN can help localize smaller parts. Based on this, R-CNN 
is shown to be effective for fine-grained category detection. 
However, this method does not consider the complex rela¬ 
tions between different parts csi and is not applicable to 
human pose estimation. 

CNN for human pose estimation. In UTEl , a cascade 
of CNN-based joint regressors are applied to reason about 
pose in a holistic manner and the developed method was 
named ‘DeepPose’. The DeepPose networks take the full 
image as the input and output the ultimate human pose with¬ 
out using any explicit graphical model or part detectors. 

In (331, Jain et al. introduce a CNN-based architecture 
and a learning technique that learns low-level features and 
a higher-level weak spatial model. Following (331, Tomp¬ 
son et al. show that the inclusion of a MRF-based graphical 
model into the CNN-based part detector can substantially 
increase the human pose estimation performance. Differ¬ 
ent from DeepPose and Tompson et al. (34l, the proposed 
method takes both the object proposals and the full body as 
the input for training, instead of using the sliding-windowed 
patches, to capture the local body parts with better semantic 
meanings in multiple scales. 

3. Problem Description and Notations 

In this paper, we adopt the following notations. A hu¬ 
man pose can be represented by a set of human joints 
J = where = {xi,yi)^ denotes the 

2D coordinate of the joint i and L is the number of human 
joints. In this paper, we are interested in estimating the 2D 
joint locations J from a single image /. Since our detection 
and regression are applied to a set of image patches, in the 
form of rectangular bounding boxes, detected in /, it is nec¬ 
essary to convert absolute joint coordinates in image / to 
relative joint coordinates in an image patch. Furthermore, 
we introduce a normalization to make the locations invari¬ 
ant to size of different image patches. Specifically, given an 
image patch p, the location of p is represented by 4-element 
vector p = {w (p), h (p), c (p))^, where w (p) and h (p) 
are the width and height of p, c (p) = {xc (p), Vc (p))^ is 
the center of p. Then the normalized coordinate of joint 
relative to p can be denoted as 

ji(p) = {Xi (p),yi(p))'^ 

^f xj-xciv) yi-yc(j>) \^ ,,, 

V w{p) ’ h{p) ) ■ 

Furthermore, the visibility of all the joints in p is denoted 



(c) Right wrist (d) Left wrist 

Figure 2. Extended part and body patches containing (a) right an¬ 
kle, (b) left ankle, (c) right wrist, and (d) left wrist from the LSP 
training dataset. For each local part, the part patches are shown 
in the left while the corresponding body patches are shown in the 
right. 


as V(p) = {vi (p)}^i G where 

= ki(p)l < 0.5and|2/j(p)| < 0.5 
* |0, otherwise. 

If (p) = 1, it indicates that the joint i is visible in 
p, i.e., it is located inside the patch p. On the contrary, if 
(p) = 0^ it indicates that the joint i is invisible in p, i.e., 
it is located outside of p. 


4. Model Inputs 

As described earlier, to combine the local part appear¬ 
ance and the holistic view of each part, the proposed DS- 
CNN takes two inputs for training and testing - image 
patches and the full body. To make it clearer, we call the 
former input the part patches, denoted as p^, and the lat¬ 
ter body patches, denoted as p^. So the dual-source input is 
Pp,6 = (Pp? Pb)- Randomly selected samples for these two 
kinds of inputs are shown in Fig. where for each local 
part, the part patches are shown in the left while the corre¬ 
sponding body patches are shown in the right. From these 
samples, we can see that it is difficult to distinguish the left 
and right wrists, or some wrists and legs, based only on the 
local appearance in the part patches. 

As we will see later, we use object proposals detected 
from an image for training and object proposals usually 
show different sizes and different aspect ratios. CNN re¬ 
quires the input to be of a fixed dimension. In (TTl, all the 
object proposals are non-uniformly scaled to a fixed-size 
square and it may need to vary the original aspect ratios. 
This may complicate the CNN training by artificially intro¬ 
ducing unrealistic patterns into training samples. In partic¬ 
ular, in our model we are only interested in the body joint 
that is closest to the center of a part patch (This will be 
elaborated in detail later). If the part patch is non-uniformly 
scaled, the joint of interest may be different after the change 





of the aspect ratio. Thus, in this paper we keep the aspect 
ratio of image patches unchanged when unifying its size. 
Specifically, we extend the short side of the image patch 
to include additional rows or columns to make it a square. 
This extension is conducted in a way such that the center 
of each image patch keeps unchanged. After the extension, 
we can perform uniform scaling to make each patch a fixed- 
size square. This extension may not be preferred in object 
detection mi by including undesired background informa¬ 
tion. However, in our problem this extension just includes 
more context information of the joint of interest. This will 
not introduce much negative effect to the part detection. The 
only minor effect is the a subtle reduction of the resolution 
of each patch (after the uniform scaling). 

Part Patches In the training stage, we construct part 
patches in two steps. 1) Running an algorithm to construct a 
set of category-independent object proposals. Any existing 
object proposal algorithms can be used here. In our experi¬ 
ments, we use the algorithm developed in ED. 2) Select a 
subset of the constructed proposals as the part patches. We 
consider two factors for Step 2). First, we only select ob¬ 
ject proposals with a size in certain range as part patches. 
If the size of an object proposal is too large, it may cover 
multiple body parts and its appearance lacks sufficient reso¬ 
lution (after the above-mentioned uniform scaling) for joint 
detection and localization. On the contrary, if the size of an 
object proposal is too small, its appearance may not provide 
sufficient features. To address this issue, we only select the 
object proposals Pp with an area in a specified range as part 
patches, i.e., 

(J) < w (pp) • h (pp) < (J) (3) 

where d(J) is the distance between two opposing joints 
on the human torso ca. /ii and /i 2 are two coefficients 
(yUi<yU 2 ) that help define the lower bound and the upper 
bound for selecting an object proposal as a part patch. 

Second, from the training perspective, we desire all the 
body joints are covered by sufficient number of part patches. 
In the ideal case, we expect the selected part patches cover 
all the joints in a balanced way - all the joints are covered 
by similar number of part patches. We empirically examine 
this issue and results are shown in Fig. [^- on both FLIC 
and LSP datasets, this simple part-patch selection algorithm 
provides quite balanced coverage to all the joints. In this 
figure, the x-axis indicates the label of different joints - only 
upper-body joints are shown in FLIC dataset while all 14 
body joints are shown in LSP dataset. The y-axis indicates 
the average number of part patches that covers the specified 
joint in each image. Here we count that a part patch covers a 
joint if this joint is visible to (i.e., located inside) this patch 
and this joint is the closest joint to the center of this patch. 
At each joint, we show three part-patch coverage numbers 
in three different colors. From left to right, they correspond 




Figure 3. The average number of part patches that cover each joint 
in (a) FLIC and (b) LSP datasets. Three colors indicates the results 
by selecting different values of 1.0, 1.5 and 2.0 respectively. 


to three different /i 2 values of 1.0, 1.5 and 2.0 respectively. 
In this empirically study, we always set /ii = 0.1. 

In the testing stage, part patches are selected from multi¬ 
scale sliding windows (this will be justified in Sectionj^. 

Body Patches Similarly, in the training stage we con¬ 
struct body patches by selecting a subset of object propos¬ 
als from the same pool of object proposals detected from 
the image. The only requirement is that the selected body 
patch should cover the whole body or all the joints, i.e., 

L 

^ Vi (Pb) = L. (4) 

In the testing stage, the body patch can be generated by 
using a human detector. For the experiments in this paper, 
each testing image only contains one person and we simply 
take the whole testing image as the body patch. 

For DS-CNN, each training sample is made up of a part 
patch Pp, a body patch p^, and the binary mask that spec¬ 
ifies the location of Pp in p^, as shown in Fig. where 
both Pp and p^ are extended and normalized to a fixed-size 
square image. For part patch pp, we directly take the RGB 
values at all the pixels as the first source of input to DS- 
CNN. For body patch p^, we take the binary mask as an ad¬ 
ditional alpha channel and concatenate the RGB values of 
P 5 and the alpha values as the second source of input to DS- 
CNN. Given that we extend and normalize all the patches 
to a A" X A" square, the first source of input is of dimen¬ 
sion 3 A^ and the second source of input is of size 4 A^. In 
the training stage, based on the constructed part patches and 
body patches, we randomly select one from each as a train¬ 
ing sample. For both training and testing, if the selected part 
patch is not fully contained in the selected body patch, we 
crop the part patch by removing the portion located outside 
the body patch before constructing the training or testing 
sample. 

5. Multi-Task Learning 

We combine two tasks in a single DS-CNN - joint detec¬ 
tion, which determines whether a part patch contains a body 
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Figure 4. The structure of DS-CNN. 


joint, and joint localization, which finds the exact location 
of the joint in the part patch. Each task is associated with a 
loss function. 

Joint detection For joint detection, we label a patch-pair 
Pp,b to joint i*, where 

ji(Pp) IP if ELi (Pp) > 0 

1<2<L 

0 otherwise, 

(5) 

and this is taken as the ground truth for training. 

Let the DS-CNN output for joint detection be 
(4 (Pp.b) ,fi {Pp,b) (Pp.b))^, where 4 indicates the 
likelihood of no body joint visible in Pp and 
represents the likelihood that joint i is visible in Pp and is 
the closest joint to the center of Pp . We use a softmax clas¬ 
sifier where the loss function is 

L 

Cd (Pp.b) = - y] 1 (i* (Pp) = i) log {ii (Pp,6)), (6) 

2 = 0 

where 1 (•) is the indicator function. 

Joint localization Joint localization is formulated as 
a regression problem. In DS-CNN training, the ground- 
truth joint location for a patch-pair Pp^^ is ji*(pp) (Pp) = 

{Pp),yi*(p^) (Pp))^, where i* (pp) is defined in 
Eq. Q. Let the DS-CNN output on joint localization be 
{Zi {Pp,b)}f=i e , where z* (pp,b) = {xi.Vif de- 

notes the predicted location of the i—th joint in pp. We use 
the mean squared error as the loss function. 


where the summation is over all the training samples (patch 
pairs) and \d > 0 is a factor that balances the two loss 
functions. 

6. DS-CNN Structure 

The structure of the proposed DS-CNN is based on the 
CNN described in 12^ . which is made up of five convolu¬ 
tional layers, three fully-connected layers, and a final 1000- 
way softmax, in sequence. The convolutional layers 1, 2 and 
5 are followed by max pooling. In the proposed DS-CNN, 
we include two separate sequence of convolutional layers. 
As shown in Fig. one sequence of 5 convolutional lay¬ 
ers takes the part-patch input as defined in Section and 
extracts the features from local appearance. The other se¬ 
quence of 5 convolutional layers takes the body-patch input 
and extracts the holistic features of each part. The output 
from these two sequences of convolutional layers are then 
concatenated together, which are then fed to a sequence of 
three fully connected layers. We replace the final 1000-way 
softmax by a (1/ + l)-way softmax and a regressor for joint 
detection and joint localization, respectively. In DS-CNN, 
all the convolutional layers and the fully-connected layers 
are shared by both the joint detection and the joint localiza¬ 
tion. 

In Fig.|^ the convolutional layer and the following pool¬ 
ing layer is labeled Ci and the fully-connected layer are 
labeled as Fi where i is the index of layer. The size of a 
convolutional layer is described as depth@width x height, 
where depth is the number of convolutional filters, width 
and height denote the spatial dimension. 


Cr (Ppjb) — 


II Zi»(pp) (Pp.b) - ji»(pp) (Pp) 11^ if i* > 0 

0 otherwise. 

(7) 

Combining the joint detection and joint localization, the 
loss function for DS-CNN is 


c = 'Y^{\dCd{.Pp,b) + Cr{,Pp,b)) ■, (8) 

Pp,b 


7. Human Pose Estimation 

Given a testing image, we construct a set of patch-pairs 
using multi-scale sliding windows as discussed in Section 
1^ We then run the trained DS-CNN on each patch-pair 
to obtain both joint detection and localization. In this sec¬ 
tion, we propose an algorithm for estimating the final hu¬ 
man pose on the testing image by combining the joint de¬ 
tection and localization results from all the patch pairs. 










































































































At first, we construct a heatmap Hi for each joint i - the 
heatmap is of the same size as the original image and Hi (x), 
the heatmap value at a pixel x, reflects the likelihood that 
joint i is located at x. Specifically, for each patch-pair 5 , 
we uniformly allocate its joint-detection likelihood to all the 
pixels in p^, i.e.. 


h (x,Pp,6) 


' (Pp.b)/ {w (Pp) • /l(Pp)) , 

< if X e Pp and (pp,^) > (pp,b) , VjV * 
0 otherwise. 


(9) 

We then sum up the allocated joint-detection likelihood over 
all the patch-pairs in a testing image as 


Hi (x) = 'Y^hi (x,pp,b). (10) 

Pp,b 

Figure]^ shows an example of the heatmap for the left wrist. 
We can see that, by incorporating the body patches, the 
constructed heatmap resolves the limb ambiguity. However, 
while the heatmap provides a rough estimation of the joint 
location, it is insufficient to accurately localize the body 
joints. 

To find the accurate location of a joint, we take the DS- 
CNN joint-localization outputs from a selected subset of 
patch-pairs, where the joint is visible with high likelihood. 
We then take the weighted average of these selected outputs 
as the final location of the joint. More specifically, we only 
select patch pairs that satisfy the following conditions 
when finding the location of joint i in the testing image. 

1. The likelihood that no body joint is visible in Pp is 
smaller than the likelihood that joint i is visible in the 
part patch, i.e. 


4 (pp.b) < (pp.b) • ( 11 ) 

2. The likelihood that joint i is visible in p^ should be 
among the k largest ones over all L joints. In a special 
case, if we set k=l, this condition requires ii (pp^b) > 

{Pp,b) 5 Vj 7 ^ i. 

3. The maximum heatmap value (for joint i) in Pp is close 
to the maximum heatmap value over the body patch 
(full testing image in our experiments). Specifically, let 



Hf = max Hi (x), 

xGPp 

( 12 ) 

and 


H^ = max Hi (x). 

xGPb 

(13) 

We require 


Hf > XhHl 

(14) 


where Xh is a scaling factor between 0 and 1. In our 
experiments, we set A= 0.9. 


Method 

Arm 

Leg 

Torso 

Head 

Avg. 

Upper 

Lower 

Upper 

Lower 

DS-CNN 

0.80 

0.63 

0.90 

0.88 

0.98 

0.85 

0.84 

DeepPose 

0.56 

0.38 

0.77 

0.71 

- 

- 

- 

Dantone et al. 

0.45 

0.25 

0.68 

0.61 

0.82 

0.79 

0.60 

Tian et al. 

0.52 

0.33 

0.70 

0.60 

0.96 

0.88 

0.66 

Johnson et al. 1361 

0.54 

0.38 

0.75 

0.67 

0.88 

0.75 

0.66 

Wang et al. 

0.49 

0.32 

0.74 

0.70 

0.92 

0.86 

0.67 

Pishchulin et al. 0 

0.54 

0.34 

0.76 

0.68 

0.88 

0.78 

0.66 

Pishchulin et al. 

0.62 

0.45 

0.79 

0.73 

0.89 

0.86 

0.72 


Table 1. PCP comparison on LSP. Note that DS-CNN, DeepPose 
(H and Johnson et al. ED are trained with both the LSP and its 
extension, while the other methods use only LSP . 


Let P* be the set of the selected patch-pairs that satisfy these 
three conditions. We estimate the location of joint i by 


j 


/ 

i 


(Pp.b)4(Pp,b) 
Lpp,(,6P’ 4 (Pp,b) 


(15) 


where z- (pp,^) is the DS-CNN estimated joint-i location 
in the coordinates of the body patch p^. As mentioned ear¬ 
lier, in our experiments, each testing image only contains 
one person and we simply take the whole image as the 
body patch p^, so z- {pp^b) can be derived from the DS- 
CNN joint localization output z^ (p^) by applying the in¬ 
verse transform of Eq. Q- 


8. Experiments 

In this paper, we evaluate the proposed method on Leeds 
Sports Pose (LSP) dataset ifTOl . the extended LSP dataset 
1 ^ . and Frames Labeled in Cinema (FLIC) dataset ca. 
LSP and its extension contains 11,000 training and 1,000 
testing images of sports people gathered from Flickr with 14 
full body joints annotated. These images are challenging be¬ 
cause of different appearances and strong articulation. The 
images in LSP dataset have been scaled so that the most 
prominent person is about 150 pixels high. FLIC dataset 
contains 3,987 training and 1,016 testing images from Hol¬ 
lywood movies with 10 upper body joints annotated. The 
images in FLIC dataset contain people with diverse poses 
and appearances and are biased towards front-facing poses. 

Most LSP images only contain a single person. While 
each image in FLIC may contain multiple people, simi¬ 
lar to na, a standard preprocessing of body detection has 
been conducted to extract individual persons. As in previ¬ 
ous works, we take the subimages of these detected individ¬ 
ual persons as training and testing samples. This way, the 
training and testing data only contain a single person and as 
mentioned earlier, in the testing stage, we simply take the 
whole image (for FLIC dataset, this means a whole subim¬ 
age for an individual person) as the body patch. 
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Figure 5. PDJ comparison on FLIC. 

LSP - Arms LSP - Legs 



Figure 6. PDJ comparison on LSP 


It has been verified that, in the training stage, the use 
of object proposals can help train better CNNs for object 
detection and part localization [17, 18]. However, in the 
testing stage, object proposals detected on an image may 
be unevenly distributed. As a result, an image region cov¬ 
ered by dense low-likelihood object proposals may unde¬ 
sirably show higher values in the resulting heatmap than a 
region covered by sparser high-likelihood object proposals. 
To avoid this issue, in our experiments we use multi-scale 
sliding windows (with sizes of O.bd (J) and d (J), stride 2) 
to provide part patches in the testing stage. 

To compare with previous works, we evaluate the per¬ 
formance of human pose estimation using two popular met¬ 
rics: Percentage of Corrected Parts (PCP) |[38]| and Percent¬ 
age of Detected Joints (PDJ) (TSl [T6]|. PCP measures the 
rate of correct limb detection - a limb is detected if the dis¬ 
tances between detected limb joints and true limb joints are 
no more than half of the limb length. Since PCP penalizes 
short limbs, PDJ is introduced to measure the detection rate 
of joints, where a joint is considered to be detected if the dis¬ 
tance between detected joint and the true joint is less than a 
fraction of the torso diameter d (J) as described in Section 
1^ For PDJ, we can obtain different detection rate by vary¬ 
ing the fraction and generate a PDJ curve in terms of the 
normalized distance to true joint ca. 

The parameters that need to be set in the proposed 
method are 

1. Lower bound coefficient jii and the upper bound coef¬ 
ficient /i 2 in Eq.([^. 

2. Balance factor \d in the loss function in Eq. 

3. k and Xh that are used for selecting patch-pairs for 
joint localization in Section]^ 



Figure 7. Visualization of the features extracted by layer Fr in DS- 
CNN. 


In our experiments, we set /ii = 0.1, /i 2 = 1-0, Xd = 4, 
k = 3, and Xh = 0.9. 

In this paper, we use the open-source CNN library Caffe 
|[^ for implementing DS-CNN. We finetune a CNN net¬ 
work pretrained on ImageNet ll29l for training the proposed 
DS-CNN. Following lUTl, the learning rate is initialized to 
a tenth of the initial ImageNet learning rate and is decreased 
by a factor of ten after every certain number of iterations. 

We first evaluate our method on LSP dataset. The PCP 
of the proposed method, DeepPose and six other compar¬ 
ison methods for head, torso, and four limbs (upper/lower 
arms and upper/lower legs) is shown in Table Except for 
‘head’, our method outperforms all the comparison methods 
including DeepPose at all body parts. The improvement on 
average PCP is over 15% against the best results obtained 
by the comparison methods. 

Figure shows the PDJ curves of the proposed method 
and seven comparison methods at the elbows and wrists on 
the FLIC dataset lfT^fT5ll33l l4l [TTl[^[2T]| . We can see that 
the proposed method outperforms all the comparison meth¬ 
ods except for Tompson et al. Tompson et al’s PDJ perfor¬ 
mance is higher than the proposed method, when normal¬ 
ized distance to true joint, or for brevity, normalized dis¬ 
tance, is less than a threshold t, but a little lower than the 
proposed method when normalized distance is larger than 
t. As shown in Fig. the value of t is 0.15 and 0.18 for 
elbows and wrists respectively. 

As a further note, Tompson et al. |[34]| combines an 
MRF-based graphic model into CNN-based part detection. 
It shows that the inclusion of the graphical model can sub¬ 
stantially improve the PDJ performance. In this paper, we 
focus on developing a new CNN-based method to detect lo¬ 
cal parts without using any high-level graphical models. We 
believe the PDJ performance of the proposed method can 
be further improved if we combine it to a graphical model 
as in Tompson et al. 

Performance comparison on LSP dataset using PDJ met¬ 
ric is shown in Fig.[^ Similar to PDJ comparison on FLIC, 
the PDJ of the proposed method is better than all the com- 
















































LSP 

ankle 

knee 

hip 

wrist 

elbow 

shoulder 

neck 

head 

mAP 

FLIC 

hip 

wrist 

elbow 

shoulder 

Head 

mAP 

Pp 

35.7 

25.5 

27.3 

20.7 

17.1 

35.0 

47.9 

70.3 

31.5 

Pp 

61.2 

56.0 

71.2 

88.8 

93.8 

72.0 

P6 

39.7 

39.6 

37.5 

21.3 

29.3 

40.7 

44.4 

70.4 

37.9 

Pb 

72.8 

59.3 

77.7 

91.0 

94.0 

77.2 

Pp,6 

44.6 

41.9 

41.8 

30.4 

34.2 

48.7 

58.9 

79.6 

44.4 

Pp,b 

74.3 

68.1 

82.0 

93.5 

96.4 

81.4 


Table 2. Average precision (%) of joint detection on LSP and FLIC testing datasets when CNN takes different types of patches as input. 



Figure 8. Human pose estimation on sample images from FLIC and LSP testing datasets. 


parison methods except for Tompson et al. When compared 
with Tompson et al, the proposed method performs substan¬ 
tially better when normalized distance is large and performs 
worse when normalized distance is small. One observation 
is that the PDJ gain of the proposed method over Tompson 
et al. at large normalized distance in LSP is more significant 
than the same gain in FLIC. 

We also conduct an experiment to verify the effective¬ 
ness of using dual sources of inputs: Pp and p^. In this 
experiment, we compute the average precision (AP) of the 
joint detection when taking either I) only part patches 
, 2) only body patches p^, or 3) the proposed patch pairs 
Pp^b as the input to CNN. The results are shown in Table 
On both LSP and FLIC testing datasets, the use of the dual¬ 
source patch-pairs achieves better AP at all joints, and the 
best mAP, the average AP over all the joints. Note that the 
body patch p^ in this paper actually include part patch infor¬ 
mation, in the form of a binary mask as discussed in Section 
1^ That’s why the use of only p^ can lead significantly better 
AP than the use of only pp on both LSP and FLIC testing 
datasets. However, the binary mask is usually of very low 
resolution because we normalize the body patch to a fixed 
dimension. As a result, we still need to combine p^ and p^ 
and construct a dual-source CNN for pose estimation. 

Following ifTTl l40l . we visualize the patterns extracted 
by DS-CNN. We compute the activations in each hidden 
node in layer F 7 on a set of patch-pairs and Figure [7] shows 
several patch pairs with the largest activations in the first 
node of F 7 . We can see that this node fires for two pose 
patterns - the bent right elbow and the right hip. For each 
pattern, the corresponding full-body pose also show high 
similarity because of the inclusion of both part and body 
patches in DS-CNN. 


Finally, sample human pose estimation results on FLIC 
and LSP testing datasets are shown in Fig. In general, 
upper-body poses in FLIC are usually front-facing, while 
the full-body pose in LSP contains many complex poses. As 
a result, human pose estimation in LSP is less accurate than 
that in FLIC. By including holistic views of part patches, 
the proposed method can estimate the human pose even if 
some joints are occluded, as shown in Fig.[^ 

9. Conclusion 

In this paper, we developed a new human pose estima¬ 
tion method based on a dual-source convolutional neutral 
network (DS-CNN), which takes two kinds of patches - 
part patches and body patches - as inputs to combine both 
local and contextual information for more reliable pose 
estimation. In addition, the output of DS-CNN is designed 
for both joint detection and joint localization, which are 
combined for estimating the human pose. By testing on 
the FLIC and LSP datasets, we found that the proposed 
method can produce superior performance against several 
existing methods. When compared with Tompson et al ED, 
the proposed method performs better when normalized 
distance is large and performs worse when normalized 
distance is small. The proposed method is implemented 
using the open-source CNN library Caffe and therefore has 
good expandability. 
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